New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deoptimization Engine #1419
Deoptimization Engine #1419
Conversation
To suppress warning, add attribute if possible. * template/insns_info.inc.tmpl (insn_len): now that this file is included from more than one file, chances are that those static functions can be left unused depending on compilation unit. On such situation in order to prevent compiler warning we annotate this function being possibly unused. * template/insns_info.inc.tmpl (insn_name): ditto. * template/insns_info.inc.tmpl (insn_op_type): ditto. * template/insns_info.inc.tmpl (insn_op_types): ditto. * configure.in (MAYBE_UNUSED): check if unused attribute is supported by the compiler.
This is important because signed operations shall not overflow in C. If you want it rouldtrip you can just cast to unsigned then cast back, but if you want to avoid overflows completely, I believe this is the fastest method. * internal.h (SATURATION_ADD): new macro that does addition of two variables in satirated integer arithmetic. * configure.in: check for __builtin_add_overflow compiler intrinsic which is available in recent gcc and clang, with expectedly-optimal implementation of overflow detection.
This is handy when the current insn is auto-generated (e.g. operand unification). This by itself has no runtime overheads. * tool/instruction.rb (RubyVM::Instruction#sp_increase_macro_expr): new method, to define a macro inside of each insn. * tool/instruction.rb (RubyVM::VmBodyGenerator#make_header_defines): a bunch of new macros to define, which are (mostly) compile-time constants, that characterize many of instructions, such as its name. * tool/instruction.rb (RubyVM::VmBodyGenerator#make_footer_undefs): clean up the new macros.
The struct rb_id_table can hold arbitrary VALUE values. Now that this struct is reachable form Ruby's object space, it must understand what GC requests to it. * id_table.c (rb_id_table_mark): new function to avoid GC leak for id tables. This function applies against all implementations. * id_table.h (rb_id_table_mark): ditto.
This struct is ID-to-VALUE mapping. Which must be convertible to a Hash. * id_table.h (rb_id_table_to_h): new function to convert a struct rb_id_table to a Hash. Expected to be used for inspection purpose from inside a ruby script. * id_table.c (rb_id_table_to_h): ditto.
This implements arbitrary key-value pair on each ISeq, that are meant to be used when optimize. Much like C++11's generalized attributes. Note that, as this is meant to be optimization purpose, you cant modify the attributes from a ruby script. The ruby level API is for debug purpose. * vm_core.h (rb_iseq_constant_body): new attributes field to hold arbitrary key-value for any ISeq. * iseq.c (rb_iseq_annotate): ISeq attribute setter. This is intentionally not exposed to 3rd parties now, to prevent potentially malicious usage of this feature. * iseq.c (rb_iseq_annotated_p): obtain specific attribute of an ISeq. * iseq.c (iseqw_attributes): obtain attributes hash of an ISeq.
This variable is expected to be an integer type which can be incremented atomically. Expected to be used where certain object's "freshness" is vital, e.g. when invalidating a cache. * vm.c (ruby_vm_global_timestamp): new variable to hold per-VM timestamp counter. * vm_insnhelper.h (INC_GLOBAL_TIMESTAMP): macro to increase timestamp. * vm_method.c (rb_clear_constant_cache): increase timestamp. * vm_method.c (rb_clear_method_cache_by_class): ditto. * vm.c (rb_next_class_serial): ditto.
This commit adds a new file named deoptimize.c, which basically saves yet-to-optimize instruction sequence aside, and reverts to that state on occasions. The "state" here means the set of ISeq raw sequence and program counter. Or, the "optimization" assumes that everything other than that two are shared among optimized / non-optimized variant. Examples of such info includes for instance catch table; so under this infrastructure you cannot optimize an iseq which deals exceptions yet. That is to be developed later. There is no corresponding on-the-fly optimizer now; this commit only adds overheads. If you take a benchmark on this commit it should be (hopefully slightly) slower than the original. Any optimizations that utilizes this mechanism must be sufficiently faster than the overhead introduced in this commit. * deoptimize.h: new file. * deoptimize.c: new file. * common.mk (COMMONOBJS): dependencies for new files. * vm_core.h (rb_iseq_constant_body): new field. * iseq.c (rb_iseq_free): proper handling of new field. * iseq.c (iseq_memsize): ditto. * vm_insnhelper.c (vm_check_iseq_freshness): new function to check if given iseq is stale; if so, deoptimization is kicked. * vm_insnhelper.c (vm_push_frame): called from here right at the beginning of an iseq execution. * vm_insnhelper.h (CALL_METHOD): also, a method call can result in redefinition(s) of methods, like by requiring something. At a graceful return from a method we shall check freshness of current iseq too. * vm.c (rb_vm_global_timestamp): export this variable. * iseq.c (rb_iseq_new_with_opt): prepare deoptimiation
This commit adds on-the-fly ISeq analyzer. It detects an ISeq's purity, i.e. if that ISeq has side-effect or not. Purity is the key concept of whole optimization techniques in general, but in Ruby it is yet more important because there is a method called eval. A pure ISeq is free from eval, while those not pure are stuck in the limbo where any of its side effects _could_ result in (possibly aliased) call to eval. So an optimization tend not be possible against them. Note however, that the analyzer cannot statically say if the ISeq in question is pure or not. It categorizes an ISeq into 3 states namely pure, not pure, or "unpredictable". The last category is used when for instance there are branches yet to be analyzed, or method calls to another unpredictable ISeq. An ISeq's purity changes over time, not only by redefinition of methods, but by other optimizations, like, by entering a rarely-taken branch of a formerly-unpredictable ISeq to kick analyzer to fix its purity. Such change propagates to its callers. * optimize.c: new file. * optimize.h: new file. * common.mk (COMMONOBJS): dependencies for new files. * iseq.h (ISEQ_NEEDS_ANALYZE): new flag to denote the iseq in question might need (re)analyzing.
Introduces a mechanism to replace a pair of send + pop instructions into strength-reduced equivalent of adjuststack + nop(s), when that transformation is guaranteed possible. When some method redefinition happens to break that assumption, it falls back to its original sequence. * insns.def (adjuststack): kind of "eat"s immediately preceding instruction to move itself into there. This is the key part of the whole optimization we implement in this commit. * insns.def (send): now subject to be eliminated. When there is a sequence of "send then pop", and that send does not include any side effects, we can safely ignore those two instructions. * insns.def (opt_send_without_block): ditto. * insns.def (invokesuper): ditto. * insns.def (putnil): this insn is already done in iseq_peephole_optimization() before execution. But while optimization process goes on during runtime, additional places where this insn can be considered can appear. We would like to wipe them out also. * insns.def (putself): ditto. * insns.def (putobject): ditto. * insns.def (putstring): ditto. * insns.def (concatstringd): ditto. * insns.def (freezestring): ditto. * insns.def (toregexp): ditto. * insns.def (newarray): ditto. * insns.def (duparray): ditto. * insns.def (newhash): ditto. * insns.def (newrange): ditto. * compile.c (iseq_peephole_optimize): replace pop with adjuststack 1; possibility is this instruction can "eat" the immediately preceding instruction to increase the operand. * compile.c (REPLACE_ELEM): resurrect, because it now has a client. * vm.c: add necessary header files. * common.mk (vm.$(OBJEXT)): update to follow new dependencies. * optimize.c (wipeout_pattern): pattern buffer, filled with nop. Technically it is interesting because the pattern filled in is a pointer to a function-static variable of another function; it is breaking variable visibility. * optimize.c (construct_pattern): constructor function to fill in static pattern. * optimize.c (iseq_eliminate_insn): fill the pattern to patch the memory region the program counter currently points to. This is where the actual "optimization" takes place. * vm_core.h (cfp_last_insn): latest instruction's place, length, and stack increase. Note that it does not hold a direct pointer, because deoptimization can change instruction sequence during creation of this field and its use. * vm_core.h (rb_call_cache): new field to hold the "temperature" of the call site this cache represents. This field is explicitly declared signed because we ned sub-zero temperature. * vm_insnhelper.h (PREPARE_FOR_ELIMINATION): remember the instruction currently running to CFP, for possible later elimination. * vm_insnhelper.h (PREPARE_ELIMINATE_SENDISH): a variant, which remembers the current instruction only when it is hot. * vm_insnhelper.c (vm_eliminate_insn): because this optimization routine resides in a super duper hot path it is worth inlining some checks before actual function call is issued. This function is placed here for that reason. * vm_insnhelper.c (vm_is_hot): check if it is safe to optimize the given caller site. Note that the safety is tested against what is about to be called, not the currently running ISeq. * deoptimize.c (iseq_deoptimize): reset temperatures of each call caches on deoptimization. * vm_insnhelper.h (inc_temperature): new macro to do a saturated ++
There are cases where push push push ... pop pop pop -sort of sequence appear. And if the middle of such sequence was optimized, nop and adjuststack can be adjacent. On such situations, moving adjuststack forward to leave nop behind can trigger further optimization. * optimize.c (iseq_move_nop): swap "nop adjuststack" sequence to make room for optimization. * vm_insnhelper.c (vm_move_nop): ditto. * insns.def (adjuststack): swap nop if possible.
This is a very limited version of variable liveness analysis. Only local variables which are never referenced (assigned then made garbage immediately) are detected, and eliminated. One _could_ do more if analysis happen much earlier than what we do so now. Current approach happens after considerable amount of exectuion was already made using the compiled sequence. But analyzing earlier should impact execution of evil activities like eval. That is not acceptable as of today, so what is possible is very limited. * insns.def (setlocal): introduces new concept of per-iseq temperature. If an ISeq seems to be executed many times (or for a long time), trigger iseq analysis. * insns.def (getlocal): now subject to elimination. * insns.def (duparray): ditto. * insns.def (leave): analyze the leaving iseq. * optimize.c (iseq_eager_optimize): optimize an ISeq as a whole. Some optimizations can be possible right after it was analyzed and this function is here to do that. * optimize.c (iseq_analyze): now checks for local variable's (very restricted) liveness. Can be used for optimizations. * optimize.c (iseq_analyze_i): rename from iseq_purity, and recursively checks for child ISeq. * optimize.h (iseq_local_variable_is_writeonly): utility function to check if a local variable at given index is write only. * vm_core.h (rb_iseq_constant_body): new field to hold per-iseq temperature. * iseq.h (ISEQ_EAGER_OPTIMIZED): new ISeq flag, denoting if the ISeq have experienced eager optimization. * deoptimize.c (iseq_deoptimize): reset new fields and flags on deoptimize.
This is simply replacing an access to a constant to an immediate. This has been possible before; why we didn't is because a constant can (in spite of its constant-ness) be reassigned. Now that we can revert such optimizations, why not optimize them. * insns.def (getinlinecache): inline cache is already managing constant validity so lets just step forward. Replace itself with putobject. * optimize.c (iseq_const_fold): * optimize.c (iseq_squash): It is the third time I write this memcpy so I think it's time to refactor this into a function. * optimize.c (iseq_move_nop): use iseq_squash(). * optimize.c (iseq_eliminate_insn): ditto.
Integer binary operations are purely mathematical by default. They can be computed beforehand. Here, for each of such operations, we check for immediately shibling instructions and if it seems it is the case, replace them with the computed value that sits on stack top. * optimize.c (iseq_move_nop): generalized so that any instruction can move around. * vm_insnhelper.c (vm_move_nop): ditto. * vm_insnhelper.h (MOVE_NOP): ditto. * insns.def (putobject): subject to move around. * insns.def (adjuststack): ditto. * vm_core.h (rb_control_frame_struct): new field counting the number of continuous putobject instructions. If this value is sufficiently large, we can consider const-folding them. * tool/instruction.rb (RubyVM::VmBodyGenerator#make_footer): set/reset the new field, depending on instructions. This assigned memory region is used much much later so the assignment is mostly freely reorderable. * insns.def (opt_constfold): new (quasi-)instruction that never reaches when not optimized. Its purpose is to detect whether constant-folding is possible and if so, replace previous instructions with single putobject. * insns.def (opt_plus): subject to const fold. * insns.def (opt_minus): ditto. * insns.def (opt_mult): ditto. * insns.def (opt_div): ditto. * insns.def (opt_mod): ditto. * insns.def (opt_eq): ditto. * insns.def (opt_neq): ditto. * insns.def (opt_lt): ditto. * insns.def (opt_le): ditto. * insns.def (opt_gt): ditto. * insns.def (opt_ge): ditto.
I read through this a bit and have a question: can it deopt during a method body? If I'm reading it right, the deopt can only occur immediately before entry into the method, which means off-thread changes won't be seen until you re-enter the optimized method again. Is that correct? |
@headius very good point. My answer is yes and no. On this patch deoptimizations can occur at two points:
In short, inter-thread method tampering is not checked explicitly. There are chances for deoptimizations during a method execution, but not immediately when another thread changed something. I think we can detect off-thread changes. For now we have GVL so luckily we only need to check it right after when we acquire GVL. That should suffice for now. If we decide to give up GVL, we need to introduce some sort of thread checkpoint instead. |
@headius kindly pointed out that we lack considerations of multi-threads. It is true that some other thread might change something alongside, so we have to check VM state right after when we context switch. Given we have GVL, this should suffice. * thread.c (rb_threadptr_execute_interrupts): deoptimize, just in case other threads might have changed something. * internal.h (rb_vm_global_timestamp): declared. * common.mk (thread.$(OBJEXT)): add dependency.
pushed 2f7bfbf. It checks for VM timestamp every time context switch hapens. |
Another option could be to have any operation which will cause deoptimisation to do the work to deoptimise the effected methods itself, before it transfers control back to them by releasing the GVL. Then the fast path (optimised methods) doesn't have to do a check. If they're still running, then they didn't have to deoptimise. This is an advantage of the GVL. You shouldn't have to ever check if a method should deoptimise, if whoever causes the deoptimisation does the deoptimisation work itself. It only gets slightly more complicated with parallel threads, where you need to stop the world first. |
@chrisseaton true. However we now don't manage which method is optimized in which way so we have to start managing that part. Because methods are subject to be GCed, management structure (if any) needs to have some sort of weak reference. I think this is a bit too large to include in this pull request. There is another possible way to know which one to deopt; scan all methods every time deoptimization is requested. I doubt if this can be done in a timely fashion. |
@shyouhei Oh indeed, it should not be difficult to detect that there's been a deoptimization event across threads. The tricky bit will be deoptimizing the currently-executing iseqs without losing ipc, stack state, etc. I don't think that part exists yet, correct? |
@headius no. I mean, you are correct. That (avoiding to touch VM states) is why we don't see considerable deopt overheads in above benchmarks. @ko1 commented in-person that I would need to juggle VM states sooner or later I continue to develop, but not yet in this pull request. It is still just patching iseqs. |
Is there anything holding this back atm? |
Builds on diff --git a/optimize.c b/optimize.c
index c0fee33..c9e666f 100644
--- a/optimize.c
+++ b/optimize.c
@@ -637,7 +637,7 @@ iseq_analyze(rb_iseq_t *iseq)
}
}
-static const VALUE wipeout_pattern[8]; /* maybe 5+2==7 should suffice? */
+static VALUE wipeout_pattern[8]; /* maybe 5+2==7 should suffice? */
static VALUE adjuststack;
static VALUE nop;
static VALUE putobject;
|
Is this moving forward? Is there anything we can do to help? |
Some suggestions:
|
I'm seeing errors when running rake's test suite: and errors when running |
I get a different error on "make". I'm building with CLang on Mac. |
I thought cfp is created per a method call and will not be reused, but it turned out to be a wrong idea because sometimes (like vm_call_opt_send situation), vm_push_frame is not called. We have to explicitly check if the cfp is not clobberd. * vm_core.h (cfp_last_insn): new field, to hold iseq's encoded body. * vm_insnhelper.h (PREPARE_FOR_ELIMINATION): store the encoded body into last insn info. * vm_insnhelper.c (vm_eliminate_insn): check if the iseq in question is the one sotred earlier.
A cache entry can become stale on occasions. That's natural, but the problem is the cache contents might include pointer(s) to already- reclaimed memory regions. By touching them we face SEGVs, so we have to avoid such situation carefully. * optimize.c (cc_is_stale): check if the cache is fresh. * optimize.c (purity_of_cc): ditto. * vm.c (ruby_vm_global_method_state): disclose.
This prevents over-optimization. A minimal example to reproduce the situation is: ``` % ruby --dump=insns -e '(true ? 0 : 1) + 1' == disasm: #<ISeq:<main>@-e>============================================ 0000 trace 1 ( 1) 0002 putobject_OP_INT2FIX_O_0_C_ 0003 jump 6 0005 putobject_OP_INT2FIX_O_1_C_ 0006 putobject_OP_INT2FIX_O_1_C_ 0007 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache> 0010 leave ``` Here, the `jump 6` jumps into a middle of putobject - putobject - send sequence. But because that sequence is subject to constant-folding, once they are replaced to `putobject 2`, address number 6 is the literal 2. So it breaks. In order to prevent we now introuce a "purposeful" nop right after a branch and now it looks like: ``` % ./miniruby --dump=insns -e '(true ? 0 : 1) + 1' == disasm: #<ISeq:<main>@-e>============================================ 0000 trace 1 ( 1) 0002 putobject_OP_INT2FIX_O_0_C_ 0003 jump 6 0005 putobject_OP_INT2FIX_O_1_C_ 0006 nopp 0007 putobject_OP_INT2FIX_O_1_C_ 0008 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache> 0011 leave ``` * insns.def (nopp): new instruction. * compile.c (iseq_compile_each): insert new nopp instruction right after a NODE_IF compilation finish, to prevent constant-folding situation. * optimize.c (purity_of_insn): take care of the new instruction.
OK, after close inspection I found several bugs and fixed them all. It now bundles and runs Rails 5.0.0.1, at least for me. |
how come the appveyor test results aren't public? |
Still doesn't compile on Mac with CLang. I have changes locally that allow it to build but segfault. I've attached the "git diff". I'll try to get it to build and run successfully. |
Hmm. Thank you. I've not tested this against clang so far. Will take a look. |
Beofre this we prepared every iseqs. Koichi Sasada pointed this should be a problem when there are lots of non-optimizable sequences. In order to reroute this we delay allocation of deoptimized sequences until we actually touch the sequence. * iseq.c (rb_iseq_new_with_opt): delay allocating deoptimized sequence. * optimize.c (iseq_squash): prepare just in case this is the first time we optimize this target. * optimize.c (iseq_eager_optimize): ditto. * deoptimize.h (iseq_deoptimize_if_needed): consideration of virgin iseqs.
Here is the Make output when I compile using EDIT: Link to gist rather than dump all the output into here. |
any status with this one? Is almost one year old :) |
Interested in this one as well. Any news? |
Me also. |
@pedroaliens @shyouhei any update on this issue? |
Folks, piling on doesn't help expedite an issue. |
Hi, quick update is that it's still work in progress. I got comments from other core-devs including @ko1 , so I think I need his review to merge this, once after I finished on peaceful update. Thank you for your interest! |
I successfully made and installed this, but when I tried running bundle install, I got a segfault:
|
Hello, now this PR is here for 2 years. Any news? |
This pull request was a bit too big for other core devs to review. I changed my strategy to push things forward. As you see above several parts of this pull request were separated then proposed. Some of them had already been merged, to gain ~1% speedup already. Right now I am preparing another single-cut branch which brings some more boost. Yet to be pull requested because it fails some tests right now. I hope it can be rerouted soon. |
Like I wrote above we are not going to merge this pull request as-is. So close now. |
Abstract
Implemented a way to optimize ruby's executions and to revert them. The strategy is restricted so that any VM states like program counter(s) would not be affected by the modifications. This restriction makes deoptimize much lightweight. Experiments show optimizations on this mechanism speeds up micro benchmarks, yet has no significant overheads for pitfall-ish activities, like eval.
Introduction
Ruby is said to be slow[1]. While there can be arguments on this, experiments show that it is at least not the fastest language to execute[2]. …
The reason behind this is fingered as its GC being slow, or its GVL, or its dynamic nature. But the author would like to point out the most important thing that Ruby lacks; it does not even try to optimize its execution. We believe this is the root cause of the slowness we suffer.
We wrote "try to" above. This is the essential concept of this pull request. Why Ruby did not optimize so far was that its execution might be different on occasions. For instance a
1 + 2
might not always be3
, because the+
can be redefined on-the-fly. So the interpreter does not constant-fold the expression and looks forInteger#+
every time.This redefinition however, rarely happens. Almost every time when
1 + 2
is evaluated, its value is arguably3
. Given this empirical assessment, it makes sense for Ruby to first try3
and ifInteger#+
was redefined for any reason, fall back to that. It should be much faster for vast majority of normal (not redefined) cases, while it does not break redefinition with some overheads.This sort of technique is called deoptimization[3]. Deoptimization is done more or less on other languages / implementations. Notably JVM related projects has many research in this area. For instance JRuby+Truffle already has rich application of this technique[4].
What is pull-requested here is not the direct C translation of JRuby+Truffle's. Let me describe what is going on.
The deoptimization strategy
Since 1.9, Ruby's interpreter equips a language VM (visible as
::RubyVM
from inside ruby scripts). It first generates sequences ("iseq"s hereafter) of VM instructions ("insn"s hereafter), and executes them in series. The key structure of this mechanism is calledstruct rb_iseq_constant_body
.The
iseq_encoded
field is what VM scans to execute. This is where we want to optimize. Prior to do any modifications, we hold a backup of it usingmemcpy
. In case of deoptimization event, we just overwrite this memory region with backed-up sequence usingmemcpy
to revert whatever optimizations that might have been applied to it.The upside of this strategy is that it is fairly simple. Because everything is written in pure C code we have no portability concerns. Also it is so simple that the deoptimization does not involve the VM states at all, including program counters. This has advantage when execution of instructions recurs; if we had to update program counters, we must scan the whole VM stacks to keep consistencies between before / after deoptimization process.
The downside of this strategy is that it restricts what can be optimized. For instance we cannot reorder instructions, because exceptions might raise inside of reordered sequences. We cannot change the length of iseq, because the underlying memory region cannot be reallocated.
However, even with the given restriction, we can do something.
Implemented optimizations
We have following optimizations in this pull request.
Constant folding
For instance a
1 + 2
is optimized into3
like this.The
nop
s are introduced to meet "no PC modification" restriction. In order to implement this, theopt_plus
instruction is modified to detect aputobject
-putobject
-opt-plus
sequence. Then it calculates the computation as usual, and overwrites itself with the sequence like above diff.Ruby-level constants (like
::Object
) are also subject to fold. They are resolved and replaced toputobject
with the identical value.send
eliminationTypical ruby script consists of many method invocations. They are represented in
send
or its variant instructions in an iseq. If they are optimized, the effect should not be marginal. The problem is a method call is not always optimizable. We have to determine whether we can skip calling.In order to do so a method is called as usual for the first time, and checked whether it has any side effects or not. If a method does not contain any side effects, it (more precisely, its underlying iseq) is marked as being "pure". Next time when the same method is called, we can consider eliminating the call to it.
However a method that is not pure is not always NG to optimize. A method can of course call another method. A method that is not pure can be simply not checked yet. So a method has 3 states namely "pure", "not pure", and "unpredictable". Every methods start their life as being unpredictable. Once its calling methods are all predicted to be pure, and itself has no side-effects, the state is updated to be pure.
Side effect that we care includes:
Once we know a method is pure, it is safe to eliminate a call to that method which immediately discards its return value(s). Note however, that even on such situations, evaluations of method arguments are not always eligible to be eliminated. They can have their own side effects. We just eliminate the
send
variant instruction and not its caller site setup, like following diff:This is how a
m(n())
is optimized, where methodm
is pure butn
is not.Elimination of variable assignments
As we see, method calls can not be eliminated when its return value is used. In order to relax this we would like to skip assignments of variables if possible.
However this is not easy to do precisely. A variable's liveness shall be analyzed. All types of variables except local ones are almost unable to analyze. Also because we have restriction on what is allowed here due to deoptimization process, massive iseq rearrangements like SSA conversion is not doable.
In this pull request a very limited version of variable liveness analysis is implemented. Only
local variables which are never referenced (assigned then made garbage immediately) are detected, then eliminated if possible. This is not the only thing that could be done theoretically, but practically the easiest to detect so we do this as a bridgehead to further optimizations.
Note we have to check iseqs recursively; because local variables are reachable from inside of a block, and a block is represented in a separate iseq, which means we have to check all possible iseqs that can reach a local variable to ensure no references to it exists.
After checking for usages, a useless assignment can be eliminated like this:
Experiments
To determine effectiveness of this approach we took experiments on my environment. This machine equips
Intel(R) Core(TM)2 Duo CPU T7700
CPU and runs Ubuntu 16.04. The image below is amake benchmark
result against ruby 2.4.0dev.Looking at the benchmark, the result is (generally speaking) similar to 2.4, with several exceptional cases which speeds up. The most improved
vm2_bigarray*
got huge speedup because generation of a big array is eliminated at all. Other improved benchmarks shows optimizations described above works quite well.It is worth noting that the
vm2_eval*
result is 0.933 i.e.eval
is only 2.7% slow compared to no optimization. This can be considered as deoptimization preparation cost, and it means overhead is quite small.Conclusions
Implemented a deoptimization engine for ruby and several optimizations on top of it. The proposed deoptimization strategy is lightweight so that activities like
eval
works without huge overheads.The author is not willing to argue this is the only best optimization technique that we know. Rather, even if it is suboptimal, the benchmark results show that the optimizations proposed here gains considerable amount of speed up. This is why ruby is said to be slow; it is not optimized at all. Any trivial optimizations should significantly impact it.
We believe ruby should have optimizations implemented.
Future works
More optimizations can be thought on top of this deoptimization engine. For instance branches with constant expressions could be relaxed to unconditional jumps. More analysis on variable liveness could eliminate more of them. Perversely, we could think of adding new local variables, like to eliminate common subexpressions.
The deoptimization strategy is subject to be improved. If we can suffer more slowdown of
eval
, more fine-grained setup like three-address code could be thought of. In order to do so a deoptimization engine should be able to revert IR conversion.